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Solving the Kohn-Sham eigenvalue problem constitutes the most computationally expensive part 
in self-consistent density functional theory (DFT) calculations. In a previous paper, we have pro- 
posed a nonlinear Chebyshev-filtered subspace iteration method, which avoids computing explicit 
eigenvectors except at the first SCF iteration. The method may be viewed as an approach to solve 
the original nonlinear Kohn-Sham equation by a nonlinear subspace iteration technique, without 
emphasizing the intermediate linearized Kohn-Sham eigenvalue problems. It reaches self-consistency 
within a similar number of SCF iterations as eigensolver-based approaches. However, replacing the 
standard diagonalization at each SCF iteration by a Chebyshev subspace filtering step results in a 
significant speedup over methods based on standard dagonalization. Here, we discuss an approach 
for implementing this method in multi-processor, parallel environment. Numerical results are pre- 
sented to show that the method enables to perform a class of highly challenging DFT calculations 
that were not feasible before. 



I. INTRODUCTION 



Electronic structure calculations based on first princi- 
ples use a very successful combination of density func- 
tional theory (DFT) 0, Q and pseudopotential theory 
[1, 0, H, [(|. DFT reduces the original multi-electron 
Shrodingcr equation into an effective one-electron Kohn- 
Sham equation, where all non-classical electronic interac- 
tions are replaced by a functional of the charge density. 
The pseudopotential theory further simplifies the prob- 
lem by replacing the true atomic potential with an effec- 
tive "pseudopotential" that is smoother but takes into 
account the effect of core electrons. Combining pseu- 
dopotential with DFT greatly reduces the number of one- 
electron wave- functions to be computed. However, even 
with these simplifications, solving the final Kohn-Sham 
equation can still be computationally challenging, espe- 
cially when the systems being studied are complex or 
contain thousands of atoms. 

Several approaches have been employed in solving the 
Kohn-Sham equations. They can be classified in two ma- 
jor groups: basis-free or basis-dependent approaches, ac- 
cording to whether they use an explicit basis set for elec- 
tronic orbitals or not. Among the basis-dependent ap- 
proaches, plane- wave methods are frequently used in ap- 
plications of DFT to periodic systems @, [1] , whereas lo- 
calized basis sets are very popular in quantum-chemistry 
applications Special basis sets, which do not make 

use of pseudopotentials, have also been designed for all- 
electron DFT calculations. These basis sets include local- 



ized atomic orbitals, linearized augmented plane waves, 
muffin-tin orbitals, and projector-augmented waves. A 
survey of advantages and disadvantages of these explicit- 
basis methods can be found in [f|. Real-space meth- 
ods are basis-free, and they have gained ground in recent 
years EH, EH EH due in great part to their simplic- 
ity. One advantage of real-space methods is that they 
are quite easy to implement in parallel environment. A 
second advantage is that, in contrast with the plane- 
wave approach, they do not impose artificial periodic- 
ity in non-periodic systems. Third, the application of 
potentials onto electron wave- functions is performed di- 
rectly in real space. Although the Hamiltonian matri- 
ces with a real-space approach are typically larger than 
with plane waves, the Hamiltonians are highly sparse and 
never stored or computed explicitly. Only matrix-vector 
products that represent the application of the Hamilto- 
nians on wave-functions need to be computed. 

This article focusscs on effective techniques to handle 
the most computationally expensive part of DFT cal- 
culations, namely the self-consistent-field (SCF) itera- 
tion. We present details of a recently developed nonlinear 
Chebyshev-filtered subspace iteration (CheFSI) method. 
The sequential version of CheFSI is first proposed in [l4| . 
The parallel CheFSI is implemented in our own DFT 
package called PARSEC (Pseudopotential Algorithm for 
Real-Space Electronic Calculations) EH- Although 
CheFSI is described in the framework of real-space DFT, 
the subspace filtering method can be employed to other 
Self-Consistent Field iterations. This method takes ad- 
vantage of the fact that intermediate SCF iterations do 
not require accurate eigenvalues and eigenvectors of the 
Kohn-Sham equation. 
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The Standard SCF iteration framework is used in 
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CheFSI, and a self-consistent solution is sought, which 
means that CheFSI has the same accuracy as other stan- 
dard DFT approaches. One can view CheFSI as a tech- 
nique to directly tackle the original nonlinear Kohn-Sham 
eigenvalue problems by a form of nonlinear subspace it- 
eration, without emphasizing the intermediate linearized 
Kohn-Sham eigenvalue problems. In fact, within CheFSI, 
explicit eigenvectors are computed only at the first SCF 
iteration, in order to provide a suitable initial subspace. 
After the first SCF step, the explicit computation of 
eigenvectors at each SCF iteration is replaced by a sin- 
gle subspace filtering step. The method reaches self- 
consistency within a number of SCF iterations that is 
close to that of eigenvector-based approaches. However, 
since eigenvectors are not explicitly computed after the 
first step, a significant gain in execution time results 
when compared with methods based on explicit diago- 
nalization. When compared with calculations based on 
efficient eigenvalue packages such as ARPACK [15| and 
TRLan [r| [13, a tenfold or higher speed-up is usu- 
ally observed. CheFSI enabled us to perform a class of 
highly challenging DFT calculations, including clusters 
with over ten thousand atoms, which were not feasible 
before. 

This article begins with a summary of SCF for DFT 
calculations in Section [Til Details about the parallel im- 
plementation are included in Section [TTTJ The Chebyshcv 
subspace filtering algorithm is presented in Section IIV1 
and the block Chebyshev-Davidson algorithm for the ini- 
tial diagonalization is discussed in Section [V] The block 
Chebyshev-Davidson method [HI, [l9[ improves consider- 
ably the efficiency of the diagonalization at the first SCF 
iteration, compared with the thick-restart Lanczos (TR- 
Lan) method [lj| QjJ which was used in [l4|] . The paper 
ends with numerical results in Section fVTl and a few con- 
cluding remarks. 



II. EIGENVALUE PROBLEMS IN DFT SCF 
CALCULATIONS 



Within DFT, the multi-electron Schrodingcr equation 
is simplified as the following Kohn-Sham equation: 



The Hartrec and exchange-correlation potentials de- 
pend on the charge density p{r), which is defined as 



fi(r) = ^i(r), (1) 



where ^(r) is a wave function, Ei is a Kohn-Sham eigen- 
value, h is the Planck constant, and M is the electron 
mass. In practice we use atomic units, thus h~ = M = 1. 

The total potential Vtotai-, also referred to as the effec- 
tive potential, includes three terms, 

V tot ai(p(r),r) =V ion (r)+V H (p(r),r)+V x c(p(r),r), (2) 



p(r) = 2£|^(r)| 5 



(3) 



Here n occ is the number of occupied states, which is equal 
to half the number of valence electrons in the system. 
The factor of two comes from spin multiplicity. Equation 
(|3[) can be easily extended to situations where the highest 
occupied states have fractional occupancy or when there 
is an imbalance in the number of electrons for each spin 
component. 

The most computationally expensive step of DFT is in 
solving the Kohn-Sham equation [TJ Since Vtotai depends 
on the charge density p(r), which in turn depends on 
the wavefunctions ^i, this equation can be viewed as a 
nonlinear eigenvalue problem. The SCF iteration is a 
general technique used to solve this nonlinear eigenvalue 
problem. It starts with an initial guess of the charge 
density, then obtains the initial Vtotai and solves [TJ for 
^i(r)'s to update p(r) and Vtotai- Then [1] is solved again 
for the new ^i(r)'s and the process is iterated until Vtotai 
(and also the wave functions) becomes stationary. 

In general, most of the computational effort involved 
in DFT is spent solving equation [TJ For this reason, it is 
the goal of any DFT code to lessen the burden of solving 
[Tj in the SCF iteration. One possible avenue to achieve 
this is to use better diagonalization routines. However 
this approach is limited as most diagonalization software 
has now reached maturation. At the other extreme, one 
can attempt to avoid diagonalization altogether, and this 
leads to the body of work represented by linear-scaling or 
order-N methods (see e.g. [20|). This approach however 
has other limitations. In particular, the approximations 
involved rely heavily on some decay properties of the den- 
sity matrix in certain function bases. In particular, they 
will be difficult to implement in real-space discretizations. 
Our approach lies somewhere between these extremes. 
We take advantage of the fact that accurate eigenvectors 
are unnecessary at each SCF iteration, since Hamiltoni- 
ans are only approximate in the intermediate SCF steps, 
and we exploit the nonlinear nature of the problem. The 
main point of our algorithm, developed in [bH |, is that 
once we have a good starting point for the Hamiltonian, 
it suffices to filter each basis vector at each iteration. In 
the intermediate SCF steps, these vectors are no longer 
eigenvectors but together they represent a good basis of 
the desired invariant subspace. The parallel implemen- 
tation of the idea will be discussed in Section IIVI The 
next section summarizes parallel implementation issues 
in PARSEC. 



III. THE PARALLEL ENVIRONMENT IN 
PARSEC 



where Vi on is the ionic potential, Vh is the Hartree po- 
tential, and Vxc is the exchange-correlation potential. 



PARSEC uses pseudopotcntial real-space implemen- 
tation of DFT. The motivation and original ideas be- 



hind the method go back to the early 1990s [10|, 
Within PARSEC, an uniform Cartesian grid in real-space 
is placed on the region of interest, and the Kohn-Sham 
equation is discretized by a high order finite-difference 
method [2l[ on this grid. Wavefunctions are expressed 
as functions of grid positions. Outside a specified sphere 
boundary that encloses the physical system, wavefunc- 
tions are set to zero for non-periodic systems. In ad- 
dition to the advantages mentioned in the introduction, 
another advantage of the real-space approach is that pe- 
riodic boundary conditions are also reasonably simple to 
implement [22| . 

The latest version of PARSEC is written in Fortran 
95. PARSEC has now evolved into a mature, massively 
parallel package, which includes most of the functionality 
of com par able DFT codes [2j|. The reader is referred 
to [24l, |25| for details and the rationale of the parallel 
implementation. The following is a brief summary of the 
most important points. 

The parallel mode of PARSEC uses the standard Mes- 
sage Passing Interface (MPI) library for communication. 
Parallelization is achieved by partitioning the physical 
domain which can have various shapes depending on 
boundary conditions and symmetry operations. Figure 
[T] illustrates four cube-shaped neighboring sub-domains. 
For a generic, confined system without symmetry, the 
physical domain is a sphere which contains all atoms plus 
some additional space (due to derealization of electron 
charge). In recent years, PARSEC has been enhanced to 
take advantage of physical symmetry. If the system is in- 
variant upon certain symmetry operations, the physical 
domain is replaced with an irreducible wedge constructed 
according to those operations. For example, if the sys- 
tem has mirror symmetry on the xy plane, the irreducible 
wedge covers only one hemisphere, either above or below 
the mirror plane. For periodic systems, the physical do- 
main is the periodic cell, or an irreducible wedge of it if 
symmetry operations are present. In any circumstance, 
the physical domain is partitioned in compact regions, 
each assigned to one processor only. Good load balance 
is ensured by enforcing that the compact regions have 
approximately the same number of grid points. 

Once the physical domain is partitioned, the physical 
problem is mapped onto the processors in a data-parallel 
way: each processor is in charge of a block of rows of the 
Hamiltonian corresponding to the block of grid points as- 
signed to it. The eigenvector and potential vector arrays 
are row-wise distributed in the same fashion. The pro- 
gram only requires an index function indx(i,j,k) which 
returns the number of the processor in which the grid 
point (i,j,k) resides. 

Because the Hamiltonian matrix is never stored, we 
need an explicit reordering scheme which renumbers rows 
consecutively from one processor to the next one. For 
this purpose we use a list of pointers that gives for each 
processor, the row with which it starts. 

Since finite difference discrctizction is used, when per- 
forming an operation such as a matrix- vector product, 





FIG. 1: Sample decomposition of a physical domain in PAR- 
SEC. 



communication will be required between nearest neigh- 
bor processors. For communication we use two index ar- 
rays, one to count how many and which rows are needed 
from neighbors, the other to count the number of local 
rows needed by neighbors. 

With this design of decomposition and mapping, the 
data required by the program can be completely dis- 
tributed. Being able to distribute the memory require- 
ment is quite important in solving large problems on 
standard supercomputers. 

Parallelizing subspacc methods for the linearized eigen- 
value problems (obtained from a finite difference dis- 
cretization of eqn. [T]) becomes quite straightforward with 
the above mentioned decomposition and mapping. Note 
that the subspace basis vectors contain approximations 
to eigenvectors, therefore the rows of the basis vectors are 
distributed in the same way as the rows of the Hamilto- 
nian. All matrix-matrix products, matrix- vector prod- 
ucts, and vector updates (e.g., linear combinations of 
vectors), can be executed in parallel. 

Reduction operations, e.g., computing inner prod- 
ucts and making the result available in each processor, 
are efficiently handled by the MPI reduction function 
MPI_ALLREDUCE(). 



IV. THE NONLINEAR 
CHEBYSHEV-FILTERED SUBSPACE 
ITERATION 

The main idea of ChcFSI is to start with a good initial 
subspacc V corresponding to occupied states of the initial 
Hamiltonian. This initial V is usually obtained by a diag- 
onalization step. No diagonalizations are necessary after 
the first SCF step. Instead, the subspace from the pre- 
vious iteration is filtered by a low degree-m Chebyshev 
polynomial, p m (t), constructed for the current Hamilto- 
nian H . The polynomial differs at each SCF step since 
H changes. The goal of the filter is to make the subspace 
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spanned by p m (H)V approximate the eigcnsubspace cor- 
responding to the occupied states of the final H . At the 
intermediate SCF steps, the basis need not be an accu- 
rate eigenbasis since the intermediate Hamiltonians are 
not exact. The filtering is designed so that the result- 
ing sequence of subspaces will progressively approximate 
the desired eigcnsubspace of the final Hamiltonian when 
self-consistency is reached. 

Our approach exploits the well-known fast growth 
property outside the [—1,1] interval of the Chebyshev 
polynomial, this allows us to use low degree Chebyshev 
polynomials to achieve sufficient filtering. At each SCF 
step, only two parameters are required to construct an 
effective Chebyshev filter, namely, a lower bound and an 
upper bound of the higher portion of the spectrum of 
the current Hamiltonian H in which we want p m (t) to 
be small. We propose simple but efficient ways to obtain 
these bounds, very little additional cost is required for 
the bound estimates. 

After self-consistency is reached, the Chebyshev fil- 
tered subspace includes the eigensubspace corresponding 
to occupied states. Explicit eigenvectors can be readily 
obtained by a Rayleigh-Ritz refinement (26j (also called 
subspace rotation) step. 

We refer to [1J, [27( for more algorithmic details and 
a literature survey concerning application of Chebyshev 
polynomials in DFT calculations. 

The main structure of the CheFSl method is given in 
Algorithm lfV.fi ft is quite similar to that of the standard 
SCF iteration discussed in Section II. One major differ- 
ence is that the inner iteration for diagonalization at Step 
2 is now performed only at the first SCF step. There- 
after, diagonalization is replaced by a single Chebyshev- 
Filtcrcd Subspace step, denoted as CheFS in Algorithm 

USUI 

The upper bound at step 7.1 in Algorithm lIV.ll can be 
obtained by using an upper-bound-cstimator presented 
in [T3|. The Chebyshev-f ilter step in step 7.2 calls 
a subroutine which applies the Chebyshev filter to each 
of the columns of ^. If m is the degree of the polyno- 
mial, this operation amounts to computing the sequence 
of blocks Xk, k = 2, ■ • ■ , m as follows: 

X k+1 = -(if — d)X k - X fc _i, k = 1, 2, • ■ ■ , m - 1 

e 

starting with X Q = X\ = \{H - d)X . The returned 
filtered block is "J = X m . The scalars e and c are de- 
fined by e = (b up - b low )/2 and c = (b up + b low )/2. For 
simplicity we presented here an unsealed version of the 
filtering process. To prevent the X k blocks from over- 
flowing it is safer to scale them at each iteration. The 
scaling operation is inexpensive as it uses only values of 
the Chebyshev polynomial at the approximate smallest 
eigenvalue of the Hamiltonian. The reader is referred 
to [l4[ for details. For discussion of scaling related to 
Chebyshev filtering, we refer interested readers to [3 or 
a more detailed technical report [27[ . 

The parallel implementation of Algorithms IIV.1I is 



Algorithm IV. 1 CheFSl for SCF calculation: 

1. Start from an initial guess of p(r), get Vtotai{p{r),r). 

2. Solve [~iV 2 + V total (p(r),r)} *<(r) = E^ z (r) for 
#i(r), i = 1,2,..., a. 

3. Compute new charge density p(r) = 2^"=i c l^i( r )| 2 - 

4- Solve for new Hartree potential Vh from V 2 Vh(?") = 
— 4np(r). 

5. Update Vxc;getnew V to tai{p,r) = Vi on (r) + V H {p,r) + 
Vxc{p,r) with 

a potential-mixing step. 

6. If \\Vtotal - Vtotal\\ < tol, Stop. 

7. Vtotai <— Vtotai (update H implicitly); apply the following 
Chebyshev- Filtered Subspace ('CheFSJ method to get s 
approximate wavefunctions: 

7.1) Compute b up := upper bound of the spectrum of 
H 

Set blow := the largest Ritz value from previous 
iteration. 

7.2) Perform Chebyshev filtering to the matrix \I r , 
whose column-vectors 

are the s discretized wavefunctions of ^l/i( r '); * ~ 
1, ...,s: 

= Chebyshev_f ilter(\E', m, bi ow , b up ) , 

7.3) Ortho-normalize the basis ^ by iterated Gram- 
Schmidt. 

7.4) Perform the Rayleigh-Ritz (rotation) step: 

a) Compute H = * T Jf*; 

b) Compute the eigendecomposition of H: HQ = 
QD, 

where D contains non-increasingly ordered 
eigenvalues of H , 

and Q contains the corresponding eigenvectors; 

c) 'Rotate' the basis as <1/ := ^Q; return $ and 
D. 

8. Goto step 3. 
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straightforward with the parallel paradigm discussed in 
Section IIIII We only mention that the matrix-vector 
products related to filtering, computing upper bounds, 
and Raylcigh-Ritz refinement, can easily be executed in 
parallel. The re-orthogonalization at Step 7.3 of Algo- 
rithm IIV.1I uses a parallel version of the iterated Gram- 
Schmidt DGKS method [28j], which scales better than the 
standard modified Gram-Schmidt algorithm. 

The estimated complexity of the algorithm is similar to 
that of the sequential CheFSI method in [l4| . For parallel 
computation it suffices to estimate the complexity on a 
single processor. Assume that p processors are used, i.e., 
each processor shares N/p rows of the full Hamiltonian. 
The estimated cost of a CheFS step on each processor 
with respect to the dimension of the Hamiltonian denoted 
by N, and the number of computed states s, is as follows: 

• The Chcbyshev filtering in Step 7.2 costs 0(s*N/p) 
flops. The discretized Hamiltonian is sparse and 
each matrix-vector product on one processor costs 
0(N/p) flops. Step 7. 2 requires m * s matrix-vector 
products, at a total cost of 0(s * m * N/p) where 
the degree m of the polynomial is small (typically 
between 8 and 20). 

• The ortho-normalization in Step 7.3 costs 0(s 2 * 
N/p) flops. There are additional communication 
costs because of the global reductions. 

• The eigen-decomposition at Step 7.4 costs 0(s 3 ) 
flops. 

• The final basis refinement step := ^Q) costs 
0(s 2 * N/p). 

If a standard iterative diagonalization method is used 
to solve the eigenproblcm [1] at each SCF step, then it 
also requires (i) the orthonormalization of a (typically 
larger) basis; (ii) the eigen-decomposition of the pro- 
jected Rayleigh-quotient matrix; and (iii) the basis re- 
finement (rotation). These operations need to be per- 
formed several times within this single diagonalization. 
But CheFS performs each of these operations only once 
per SCF step. Therefore, although CheFS scales in a 
similar way to standard diagonalization-based methods, 
the scaling constant is much smaller. For large problems, 
CheFS can achieve a tenfold or more speedup per SCF 
step, over using the well-known efficient eigenvalue pack- 
ages such as ARPACK [H| and TRLan [IE[l7|]. The to- 
tal speedup can be more significant since self-consistency 
requires several SCF iteration steps. 

To summarize, a standard SCF method would have 
an outer SCF loop — the usual nonlinear SCF loop, and 
an inner diagonalization loop, which iterates until eigen- 
vectors are within specified accuracy. Algorithm IIV.II 
simplifies this by merging the inner-outer loops into a 
single outer loop, which can be considered as a nonlinear 
subspace iteration algorithm. The inner diagonalization 
loop is reduced into a single Chcbyshev subspace filtering 
step. 



V. CHEBYSHEV-DAVIDSON ALGORITHM 
FOR THE FIRST SCF ITERATION 

Within CheFSI, the most expensive SCF step is the 
first one, as it involves a diagonalization in order to com- 
pute a good initial subspace to be used for latter filter- 
ing. In principle, any effective eigenvalue algorithms can 
be used. PARSEC originally had three diagonalization 
methods: Diagla, which is a preconditioned Davidson 
method [HI Hi!; the symmetric eigensolver in ARPACK 
[HI, Hil ! an d the Thick- Restart Lanczos algorithm called 
TRLan [r| G3- For systems of moderate sizes, Diagla 
works well, and then becomes less competitive relative to 
ARPACK or TRLan for larger systems when a large num- 
ber of eigenvalues are required. TRLan is about twice as 
fast as the symmetric eigensolver in ARPACK, because 
of its reduced need for re-orthogonalization. In [l4| , TR- 
Lan was used for the diagonalization at the first SCF 
step. 

For very large systems, memory can become a se- 
vere constraint. One has to use eigenvalue algorithms 
with restart since out-of-core operations can be too slow. 
However, even with standard restart methods such as 
ARPACK and TRLan, the memory demand can still sur- 
pass the capacity of some supercomputers. For example, 
the Si 

904i-Hi860 cluster by TRLan or ARPACK would 
require more memory than the largest memory allowed 
for a job at the Minnesota Supercomputing Institute in 
2006. Hence it is important to develop a diagonaliza- 
tion method that is less memory demanding but whose 
efficiency is comparable to ARPACK and TRLan. The 
Chebyshev-Davidson method [l||, [l9j] is developed with 
these two goals in mind. 

It is generally accepted that for the implicit filtering 
in ARPACK and TRLan to work efficiently, one needs to 
use a subspace with dimension about twice the number 
of wanted eigenvalues. This leads to a relatively large 
demand in memory when the number of wanted eigen- 
values is lar ge. The block Chebyshev-Davidson method 
discussed in [I9( introduced an inner-outer restart tech- 
nique. The outer restart corresponds to a standard 
restart in which the subspace is truncated to a smaller 
dimension when the specified maximum subspace dimen- 
sion is reached. The inner restart corresponds to a stan- 
dard restart restricted to an active subspace, it is per- 
formed when the active subspace dimension exceeds a 
given integer act max which is much smaller than the 
specified maximum subspace dimension. With inner- 
outer restart, the subspace used in Chebyshev-Davidson 
is about half the dimension of the subspace required by 
ARPACK or TRLan. 

We adapted the proposed Chebyshev filters into a 
Davidson-type eigenvalue algorithm. Although no Ritz 
values are available from previous SCF steps to be used as 
lower bounds, the Rayleigh-Ritz refinement step within a 
Davidson-type method can easily provide a suitable lower 
bound at each iteration. The upper bound is again es- 
timated by the upper-bound-estimator in [l4| . and it is 
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computed only once. These two bounds are sufficient for 
constructing a filter at each Chebyshev-Davidson itera- 
tion. The constructed filter magnifies the wanted lower 
end of the spectrum and dampens the unwanted higher 
end, therefore the filtered block of vectors have strong 
components in the wanted eigensubspace, which results 
in an efficiency that is comparable to that of ARPACK or 
TRLan. The main structure of this Chebyshev-Davidson 
method is sketched in Algorithm lV.il we refer interested 
readers to for algorithmic details. 

Algorithm V.l Structure outline of the block Chebyshev- 
Davidson method 

1. Compute b up using the upper- bound- estimator in 
fl2 l: set bi ow as the median of the 
eigenvalues of the tri-diagonal matrix from the upper- 
bound-estimator. 

Make the given initial size-k block Vi orthonormal, set 
V=[V 1 ]. 

[ Vf ] — Chebyshev_f ilter(Vi, m, how, b up ). 

3. Augment the basis VbyVj: V ■*— [ V, Vf ], make V 
orthonormal. 

4- Inner-restart if active subspace dimension exceeds a 
given integer actmax- 

5. Rayleigh-Ritz refinement: update matrix M s.t. M — 
V T HV; 

do eigendecomposition of M : M Y = YD; updated 
basis V: V <— V Y. 

6. Compute residual vectors, determine convergence; 
perform deflation if some eigenpairs converge. 

7. If all wanted eigenpairs converged, stop; else, adapt 
bi ow = max(diag(D)) , 

set Vi = [ the first k non-converged Ritz vectors in V 
]• 

8. Outer-restart if size of V exceeds maximum subspace 
dimension. 

9. Continue from step 2. 



The first step diagonalization by the block Chebyshev- 
Davidson method, together with the Chebyshev-filtered 
subspace (ChcFS) method, enabled us to perform SCF 
calculations for a class of large systems, including the sili- 
con cluster 5*19041 -ff i860 f° r which over 19000 eigenvectors 
of a Hamiltonian with dimension around 3 million were 
to be computed. These systems are practically infeasible 
with the other three eigensolvers (ARPACK, TRLan and 
Diagla) in PARSEC, using the current supercomputer re- 
sources available to us at the Minnesota Supercomputing 
Institute (MSI). 



VI. NUMERICAL RESULTS 

PARSEC has been applied to study a wide range of 
material systems (e.g. [lj], [H, HH). The focus of this 
section is on large systems where relatively few numeri- 
cal results exist because of the infeasibility of eigenvector- 
based methods. We mention that (30l ] contains very in- 
teresting studies on clusters containing up to 1100 silicon 
atoms, using the well-known efficient plane- wave DFT 
package VASP [1, Hl[; however, it is stated in [3(J that a 
cluster with 1201 silicon atoms is "too computationally 
intensive". As a comparison, PARSEC using CheFSI, 
together with the currently developed symmetric oper- 
ations of real-space pseudopotential methods [32| . can 
now routinely solve silicon clusters with several thousand 
atoms. 

The hardware used for the computations is the SGI Al- 
tix cluster at MSI, it consists of 256 Intel Itanium proces- 
sors at CPU rates of 1.6 GHz, sharing 512 GB of memory 
(but a single job is allowed to request at most 250 GB 
memory) . 

The goal of the computations is not to study the par- 
allel scalability of PARSEC, but rather to use PARSEC 
to do SCF calculation for large systems that were not 
studied before. Therefore we do not use different pro- 
cessor numbers to solve the same problem. Scalability is 
studied in (25| for the preconditioned Davidson method. 
Here we mention that the scalability of CheFS is better 
than eigenvector-based methods because of the reduced 
reorthogonalizations . 

In the reported numerical results, the total_eV/atom 
is the total energy per atom in electron-volts, this value 
can be used to assess accuracy of the final result; the #SCF 
is the iteration steps needed to reach self-consistency; and 
the #MVp counts the number of matrix-vector products. 
Clearly #MVp is not the only factor that determines CPU 
time, the orthogonalization cost can also be a significant 
component. 

For all of the reported results for CheFSI, the first step 
diagonalization used the Chebyshev-Davidson method 
(Algorithm |VT|. In Table [fl the 1st CPU denotes the 
CPU time spent on the first step diagonalization by 
Chebyshev-Davidson; the total CPU counts the total 
CPU time spent to reach self-consistency by CheFSI. 

The first example in Table U is a relatively small silicon 
cluster 51525^276: which is used to compare the perfor- 
mance of CheFSI with two eigenvector-based methods. 
All methods use the same symmetry operations 32j in 
PARSEC. 

For larger clusters 5i27i3-ffs28 and 5i4ooi-Hioi2 , Diagla 
became too slow to be practical. However, we could still 
apply TRLan for the first step diagonalization for com- 
parison, but we did not iterate until self-consistency was 
reached since that would cost a significant amount of 
our CPU quota. Note that with the problem size in- 
creasing, Chebyshev-Davidson compares more favorably 
over TRLan. This is because we employed an additional 
trick in Chebyshev-Davidson, which corresponds to al- 
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sysem 


dim. of H 


Testate 


#MVp 


#SCF 


total_eV/ atom 


1st CPU 


total CPU 


Si27i3Hs2s a 


1074080 


5843 


1400187 


14 


-86.16790 


7.83 hrs. 


19.56 hrs. 


S , l4001-ffl012 6 


1472440 


8511 


1652243 


12 


-89.12338 


18.63 hrs. 


38.17 hrs. 


<Sl6047-Hl308 C 


2144432 


12751 


2682749 


14 


-91.34809 


45.11 hrs. 


101.02 hrs. 


519041 ^1860 d 


2992832 


19015 


4804488 


18 


-92.00412 


102.12 hrs. 


294.36 hrs 




2790688 


1812 x 2 


9377435 


110 


-795.18064 


16.16 hrs. 


112.44 hrs. 


Fe 3 26 1 


2985992 


1956 x 2 


10241385 


119 


-795.19898 


11.62 hrs. 


93.15 hrs. 


Fe 3 eo 9 


3262312 


2160 x 2 


12989799 


146 


-795.22329 


16.55 hrs. 


140.68 hrs. 



a m = 10 for CheFS. First step diagonalization by TRLan cost 8.65 hours, projecting it into a 14-steps SCF iteration cost around 121.1 
hours. 

b First step diagonalization by TRLan cost 34.99 hours, projecting it into a 12-steps SCF iteration cost around 419.88 hours. 
c Using 32 processors. 
d Using 48 processors. 

E m = 20 for Chebyshev-Davidson; m = 19 for CheFS. 

* using 24 processors, m = 20 for Chebyshev-Davidson; m = 19 for CheFS. 
9 using 24 processors, m = 20 for Chebyshev-Davidson; m = 17 for CheFS. 

TABLE I: Performance of the CheFSI method in various test systems. All calculations were performed using 16 processors, 
and polynomial degrees m = 17 for the Chebyshev-Davidson and m = 8 for CheFSI, except when otherwise stated. 



method #MVp #SCF steps total_eV/atom CPU(secs) 
CheFSI 189755 11 -77.316873 542.43 

TRLan 149418 10 -77.316873 2755.49 

Diagla 493612 10 -77.316873 8751.24 

TABLE II: £1525 -#276 > using 16 processors. The Hamilto- 
nian dimension is 292584, where 1194 states need to be com- 
puted at each SCF step. The first step diagonalization by 
Chebyshev-Davidson cost 79755 #MVp and 221.05 CPU sec- 
onds; so the total #MVp spent on CheFS in CheFSI is 110000. 
The polynomial degree used is m = 17 for Chebyshev- 
Davidson and m = 8 for CheFS. The fist step diagonalization 
by TRLan requires 14909 #MVp and 265.75 CPU seconds. 



lowing the last few eigenvectors not to converge to the re- 
quired accuracy. The number of the non fully converged 
eigenvectors is bounded above by act max , which is the 
maximum dimension of the active subspace. Typically 
30 < act max < 300 for Hamiltonian size over a million 
where several thousand eigenvectors are to be computed. 
The implementation of this trick is rather straightfor- 
ward since it corresponds to applying the CheFS method 
to the subspace spanned by the last few vectors in the 
basis that have not converged to required accuracy. 

For even larger clusters S«6047-ffi308 an d 5*904i-Hi860j 
it became impractical to apply TRLan for the first step 
diagonalization because of too large memory require- 
ments. For these large systems, using an eigenvector- 
based method for each SCF step is clearly not feasible. 



We note that the cost for the first step diagonalization by 
Chebyshev-Davidson is still rather high, it took close to 
50% of the total CPU. In comparison, the CheFS method 
saves a significant amount of CPU for SCF calculations 
over diagonalization-based methods, even if very efficient 
eigenvalue algorithms are used. 

Once the DFT problem, Eq. |T]), is solved, we have 
access to several physical quantities. One of them is the 
ionization potential (IP) of the nanocrystal, defined as 
the energy required to remove one electron from the sys- 
tem. Numerically, we use a ASCF method: perform two 
separate calculations, one for the neutral cluster and an- 
other for the ionized one, and observe the variation in 
total energy between these calculations. Figure [H shows 
the IP of several clusters, ranging from the smallest pos- 
sible {SiHi) to S , i904ii/i860- For comparison, we also 
show the eigenvalue of the highest occupied Kohn-Sham 
orbital, Ehomo ■ A known fact of DFT-LDA is that the 
negative of the Ehomo energy is lower than the IP in 
clusters @, which is confirmed in Figure [2] In addition, 
the figure shows that the IP and —Ehomo approach each 
other in the limit of extremely large clusters. 

Figure [5] also shows the electron affinity (EA) of the 
various clusters. The EA is defined as the energy released 
by the system when one electron is added to it. Again, we 
calculate it by performing SCF calculations for the neu- 
tral and the ionized systems (negatively charged instead 
of positively charged now). In PARSEC, this sequence 
of SCF calculations can be done very easily by reusing 
previous information: The initial diagonalization in the 
second SCF calculation is waived if we reuse eigenvec- 
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FIG. 2: Ionization potential (IP, crosses) and electron affinity 
(EA, "plus" signs) for various clusters with diameters rang- 
ing from nm (SHI4) to 7 nm (SigoAiHiseo)' Squares de- 
note the negative of the highest occupied eigenvalue energy 
{—Ehomo) of the neutral cluster. Diamonds denote the neg- 
ative of the lowest unoccupied eigenvalue energy {—Elumo)- 



tors and eigenvalues from a previous calculation as initial 
guesses for the ChebFSI method. Figure [2] shows that, as 
the cluster grows in size, the EA approaches the negative 
of the lowest-unoccupied eigenvalue energy. A power-law 
analysis in Figure [2] indicates that both the ionization 
potential and the electron affinity approach their bulk 
values according to a power-law decay R n with exponent 
close to 1. The numerical fits are: 



and -3 eV. Because of the discreteness of eigenvalues in 
clusters, the DOS is calculated by adding up normalized 
Gaussian distributions located at each calculated energy 
eigenvalue. In Figure [31 we used Gaussian functions with 
dispersion of 0.05 eV. More details are discussed in (34[. 
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FIG. 3: Density of states (DOS) of the cluster Sigou-ffiseo 
(upper panel) compared with periodic crystalline silicon 
(lower panel). As a consequence of the large size, the DOS 
of the SisoAiHiseo cluster is very close to that of bulk silicon 
(the infinite-size limit). 



IP = IPq + A/D a (4) 



EA = EA - B/D^ (5) 

with IP = 4.50 eV, EA = 3.87 eV, a = 1.16, = 1.09, 
A = 3.21 eV, B = 3.13 eV. These values for A and B 
assume a cluster diameter D given in nanometers. The 
difference between ionization potential and electron affin- 
ity is the electronic gap of the nanocrystal. As expected, 
the value of the gap extrapolated to bulk, IPo — EAo = 
0.63 eV, is very close to the energy gap predicted in var- 
ious DFT calculations for silicon, which range from 0.6 
eV to 0.7 eV 0, HH]. Owing to the slow power-law decay, 
the gap at the largest crystal studied is still 0.7 eV larger 
than the extrapolated value. 

Other properties of large silicon clusters are also ex- 
pected to be similar to the ones of bulk silicon, which is 
equivalent to a nanocrystal of "infinite size" . Figure [3] 
shows that the density of states already assumes a bulk- 
like profile in clusters with around ten thousand atoms. 
The presence of hydrogen atoms on the surface is re- 
sponsible for subtle features in the DOS at around -8 eV 



We also applied PARSEC to some large iron clusters. 
Extensive analysis of the magnetic properties of iron clus- 
ters based on the methodology presented here and in pre- 
vious work[3|, has provided decisive evidence for sur- 
face effects in the magnetic moment of these systems 
[35| , confirming earlier experimental data. Table U also 
contains three clusters with more than 300 iron atoms. 
These metallic systems are well-known to be very difficult 
for DFT calculations, because of the "charge sloshing" 
@, HI- The LDA approximation used to get exchange- 
correlation potential Vxc is a l so known not to work well 
for iron atoms. However, PARSEC was able to reach 
self-consistency for these large metallic clusters within 
reasonable time length. It took more than 100 SCF 
steps to reach self-consistency, which is generally con- 
sidered too high for SCF calculations, but we observed 
(from calculations performed on smaller iron clusters) 
that eigenvector-based methods also required a similar 
number of SCF steps to converge, thus the slow conver- 
gence is associated with the difficulty of DFT for metallic 
systems. Without CheFS, and under the same hardware 
conditions as listed in Table HI over 100 SCF steps using 
eigenvector-based methods would have required months 
to complete for each of these clusters. 
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VII. CONCLUDING REMARKS 

We developed and implemented the parallel CheFSI 
method for DFT SCF calculations. Within CheFSI, 
only the first SCF step requires a true diagonaliza- 
tion, and we perform this step by the block Chebyshcv- 
Davidson method. No diagonalization is required af- 
ter the first step; instead, Chcbyshev filters are adap- 
tively constructed to filter the subspace from previous 
SCF steps so that the filtered subspace progressively ap- 
proximates the eigensubspace corresponding to occupied 
states of the final Hamiltonian. The method can be 
viewed as a nonlinear subspace iteration method which 
combines the SCF iteration and diagonalization, with the 
diagonalization simplified into a single step Chebyshcv 
subspace filtering. 

Additional tests not reported here, have also shown 
that the subspace filtering method is robust with respect 
to the initial subspace. Besides self-consistency, it can be 
used together with molecular dynamics or structural op- 
timization, provided that atoms move by a small amount. 
Even after atomic displacements of a fraction of the Bohr 
radius, the CheFSI method was able to bring the initial 
subspace to the subspace of self-consistent Kohn-Sham 
eigenvectors for the current position of atoms, with no 



substantial increase in the number of self-consistent cy- 
cles needed. 

CheFSI significantly accelerates the SCF calculations, 
and this enabled us to perform a class of large DFT cal- 
culations that were not feasible before by eigenvector- 
based methods. As an example of physical applications, 
we discuss the energetics of silicon clusters containing up 
to several thousand atoms. 
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